fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads#1693
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Verification (issue #1684 acceptance criteria)Verified locally on branch Automated checks
|
| Metric | Result |
|---|---|
| Exit code | 0 |
| Wall time | ~355s (bounded; no indefinite stall) |
| Log | Skipping worker-timeout files in sequential fallback for reallyLargeFile.ts |
checker.ts symbols |
2576 Function nodes |
reallyLargeFile.ts symbols |
0 functions (file node only; parse skipped after timeout) |
Acceptance criteria mapping
- Raised-limit analyze completes in bounded time or skips pathological files with clear log — skip path confirmed
-
reallyLargeFile.tsdoes not block entire chunk via sequential fallback — confirmed (unit test + integration log) -
compiler/checker.tsindexed under raised limits without worker timeout — confirmed in mini repro - README documents TypeScript indexing expectations — not in this PR (follow-up)
Not run (scope / time)
Full microsoft/TypeScript repo root (~39k files) end-to-end — expected to take hours even with fix; prior hang was specifically sequential fallback re-parsing reallyLargeFile.ts indefinitely.
Verdict: Fix addresses the blocking failure mode from #1684; safe to merge from a verification standpoint (pending CI).
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 9356 tests passed 1 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
…ingleton timeout WorkerPoolDispatchError previously surfaced the stalled path only for the singleton-timeout final-fail branch. Worker `error` and `exit` events (and the msg-channel `error` reply) fell back to plain `Error`, so the sequential fallback re-attempted every file in the active job — re-hanging on the same pathological file when the worker crashed mid-parse. Lift the in-flight-file inference into `inFlightExcludePath(job, lastProgress)` and wire it into the three remaining in-pool failure sites. `lastProgress` is already in `runWorker` scope, so `items[lastProgress]` (the next file the worker was about to acknowledge) is the best single guess at the culprit; earlier files are still re-tried sequentially. Returns `[]` when no path is determinable (`lastProgress >= items.length`, or path missing/non-string) so sequential retries the whole job. Replacement-worker startup failures stay plain `Error` (no job context); the result-before-flush protocol bug stays plain `Error` (code fault, not file). Tests cover the three new exclusion paths plus a negative test confirming non-WorkerPoolDispatchError throws fall through to full sequential retry.
- Use cause-neutral "worker-excluded" label in skip messages and tests now
that worker error/exit paths share the same exclusion contract as
singleton-timeout (correctness + maintainability reviewers).
- Add JSDoc to findSelfOrAncestorOfType{s} explaining the parent-walk
short-circuit vs root-DFS fallback (maintainability reviewer).
Restructures `createWorkerPool` so a single bad file no longer kills the
pool for the rest of an analyze run. Five interlocking layers:
1. **Auto-respawn on error/exit** — worker death triggers `replaceWorker`
on the same slot, bounded by `maxRespawnsPerSlot` (default 3). The slot
is dropped from rotation when the budget is exhausted; other slots
keep running.
2. **Circuit breaker** — replaces the permanent `poolBroken=true` with a
consecutive-failure counter. The pool only trips after
`consecutiveFailureThreshold` deaths (default `max(3, poolSize)`) with
no successful job in between. A successful job resets the counter so
transient bursts of bad files don't escalate.
3. **Session-scoped file quarantine** — paths identified as the in-flight
file at the moment of a worker death are added to a `Set<string>` on
the pool. `dispatch()` filters quarantined items up front (they never
reach a worker again this pool lifetime). Exposed via the new
`WorkerPool.getQuarantinedPaths()` so callers can log/route them.
`processParsing` surfaces the per-chunk quarantine summary alongside
the existing fallback-exclusion log.
4. **Authoritative in-flight tracking** — `parse-worker.ts` emits
`{type:'starting-file', path}` before each file. The pool tracks this
per slot and uses it for crash attribution, falling back to the
`items[lastProgress]` heuristic only when no starting-file has been
observed (very-early crash, older worker build). Closes the
reorder/race concerns raised by reviewers C1 and R3 in the earlier
review run.
5. **Per-job cumulative timeout budget** — each `WorkerJob` tracks the
total wall time spent across attempts/splits/retries. When the budget
is exhausted (default 5x `subBatchIdleTimeoutMs`), the pool surfaces
the in-flight path instead of letting exponential backoff balloon
into multi-hour stalls.
Cross-layer wiring: a new `wakeIdleSlots` helper kicks any non-busy live
slot when items are requeued (after a death or split-retry), so a dropped
slot doesn't strand work in the queue. `recoverAndResume` consolidates
the per-job teardown shared by the three in-pool death sites (`error`,
`exit`, msg-channel `error`).
New env knobs: `GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT`,
`GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS`,
`GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD`.
New `WorkerPoolOptions.workerFactory` injection point for unit tests.
Tests: 12 new unit tests using a FakeWorker mock cover quarantine
seeding, slot-respawn, slot-drop after budget, breaker trip + reset,
and quarantine filtering. Plus option-resolution tests for the three
new env vars. All 19 worker-pool/-fallback/-options tests pass; full
unit suite 6040 passed / 30 skipped / 0 failed.
Walks through every finding from ce-code-review run 20260519-094648-3549cf5e. All 12 picked Apply. Critical: - F1 — Layer 5 cumulative-timeout exhaustion no longer silently drops the rest of the job. `requeueRemainder` is now invoked before `handleWorkerDeath` in both Layer 5 and singleton-final-fail give-up paths so non-quarantined items get re-tried by another worker. - F2 — idle-timer recovery overhaul. `!shouldContinue` branch no longer calls `replaceWorker` (double-spawn race with the `handleWorkerDeath` inside `requeueAfterTimeout`). `shouldContinue` branch now enforces `maxRespawnsPerSlot` before respawning, closing the budget-bypass for the timeout-retry path. Also fixes premature `maybeDone` by simplifying the bookkeeping. - F3 — `requeueRemainder` no longer pre-charges `cumulativeTimeoutMs` by `job.timeoutMs`. The death itself consumed no budget, so the next `requeueAfterTimeout` was double-billing the first attempt. - F4 — `WorkerPool.getQuarantinedPaths` is now optional on the interface, matching the defensive `?.()` call site and the existing mocks. Removes the contract-vs-callsite contradiction. - F5 — per-job unattributed-death tracking. When a worker dies with no exclusion attribution, `requeueRemainder` tracks death count per `startIndex`. First time: re-queue intact. Second time: quarantine items[0] as best guess, or drop the job entirely when items lack paths. Bounds the death loop the original design admitted to. - F6 — per-slot consecutive-failure counter. Replaces the pool-wide scalar so a chronically-failing slot trips the breaker on its own streak instead of being masked by another slot's successes. Smaller: - F7 — exhaustiveness `never` check on `WorkerOutgoingMessage` union. - F8 — recursive `runWorker` on fully-quarantined jobs converted to a while-loop. - F9 — `tripBreaker` calls `reject(err)` BEFORE awaiting `worker.terminate()`. A stuck terminate no longer blocks the caller. - F10 — `parsing-processor.ts` quarantine log de-duplicates per pool instance via a `WeakMap`. Only newly-quarantined paths are logged in each chunk; the per-chunk count still surfaces via progress. - F11 — extract `firstPath` local in `requeueAfterTimeout`; eliminates double `itemPath` call and the `unknown as string` cast. Tests (F12, 6 new): - crash-error event path (errorHandler). - F5 drop-branch coverage via items without `.path`. - Common-case unattributable crash falling back to items[0] heuristic. - `replaceWorker` startup failure (workerFactory emits 'exit' before 'online'). - All-slots-dropped breaker trip. - `GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS` env override. Residual gap (deferred): no unit test exercises the Layer 5 cumulative-budget runtime path — requires fake-timer interleaving with FakeWorker that's too brittle for this iteration. Tracked. Unit suite: 257 files / 6056 passed / 30 skipped / 0 failed.
…after-timeout flow Adds 6 new real-worker integration tests covering the PR #1693 resilience layers + fixes 3 follow-on bugs surfaced while writing them. New integration coverage (real worker threads + temp fixture scripts): - `respawns the slot after worker process.exit and finishes the work on the replacement` — exercises Layer 1 auto-respawn + Layer 3 quarantine through real IPC. - `attributes exactly via authoritative starting-file message on worker crash` — Layer 4 end-to-end: starting-file message → exact quarantine attribution (not the items[0] heuristic). - `quarantine filters subsequent dispatches without sending to a worker` — second dispatch's sub-batch payload audited via filesystem; the quarantined path is never sent across the message channel. - `drops a slot after maxRespawnsPerSlot and continues on the survivor` — 2-slot pool, slot dies twice past budget, survivor finishes re-queued remainder. - `trips the circuit breaker on cascading per-slot consecutive failures` — single-slot pool, dies on every job, breaker trips after consecutiveFailureThreshold with WorkerPoolDispatchError carrying the cumulative quarantine. - `survives a worker error event (uncaught throw) the same as a process.exit` — validates recoverAndResume on the errorHandler path via a real worker `throw` (not just process.exit). Bug fixes uncovered while writing these tests: 1. **Stack-overflow recursion in runWorker's no-worker branch** — `if (!worker) { ...; wakeIdleSlots(); maybeDone(); }` recursed indefinitely when multiple slots were mid-respawn simultaneously (wakeIdleSlots → runWorker → no worker → wakeIdleSlots → …). Removed the wakeIdleSlots call: the slot's own respawn IIFE owns runWorker post-respawn, and other slots will pick up work via finishJob's runWorker. 2. **requeueAfterTimeout dispatched work before respawn completed** — the F2 fix had `requeueAfterTimeout` `void`-discarding `handleWorkerDeath`, so the `!shouldContinue` IIFE had no way to know when the respawn finished. New design: `requeueAfterTimeout` returns a `TimeoutDecision` discriminated union; the IIFE owns the death-and-respawn-and-dispatch orchestration in an async closure so it can `await handleWorkerDeath` and then call `runWorker` deterministically. 3. **Stalled-singleton + protocol-error + replacement-startup-crash tests** had stale contracts predating the resilience refactor. The stalled-singleton no longer rejects (it quarantines + resolves `[]`); the protocol-error rejection message now mentions "circuit breaker tripped"; the replacement-startup-crash test documents the known `waitForWorkerOnline` race (online fires before the worker's main script runs, so a top-level throw looks like a successful spawn) — the test asserts the file is quarantined via the second-idle-timeout give-up path. Full suite: 334 files / 8982 passed / 43 skipped / 0 failed (second run; first run had a Vitest-reported flake from an uncaught worker exception bleeding into the test report — repeated runs are clean).
| return matches.find((m) => predicate(Object.keys(m))); | ||
| } | ||
|
|
||
| function countMatchesTsx(src: string, predicate: (tags: string[]) => boolean): number { |
|
Claude finished @magyargergo's task in 7m 51s —— View job Production-Readiness Review — PR #16931. Review Bar for This PR
2. Current PR State
3. Branch Hygiene AssessmentClassification: merge-from-main commit present, but rebase/split would be strongly preferred. The PR is causally connected to #1684: the worker-hang fix required IPC changes, quarantine semantics, circuit breaker, respawn, TypeScript capture optimization, and CLI option threading — these are all interlocking and not cleanly separable. However the diff is ~6500 LOC across 34 files, includes benchmark scaffolding, a new bench/ doc directory, README updates, and CLI wiki changes. The branch has 42 commits including merge-from-main commits and a terminal merge at the head. The diff is reviewable but dense, and CI signal for the oldest commits is stale. A rebase/squash onto current The merge-commit-heavy topology does not introduce unrelated files. The scope expansion (IPC repack, transfer-list, chunk prefetching) is causally necessary: the original hang was partly caused by IPC overhead and main-thread extraction blocking worker throughput. All changed domains are within the ingestion pipeline. 4. Understanding of the ChangeWorker Pool ( IPC ( Quarantine/Cache ( Pipeline concurrency ( CLI/env ( TypeScript captures ( 5. FindingsF1 — MINOR |
|
| Domain | Tests | Real workers? |
|---|---|---|
| Worker lifecycle (crash, exit, messageerror) | worker-pool.test.ts (+512 lines), worker-pool-resilience.test.ts (659 lines) |
Yes — worker-pool.test.ts uses real Worker |
| Protocol IPC | workers/protocol.test.ts (176 lines) |
Unit (pure encoder/decoder) |
| Quarantine | workers/quarantine.test.ts (96 lines), parse-impl-quarantine-cache-skip.test.ts (398 lines integration) |
Integration test: real |
| TypeScript captures | typescript-captures-anchor.test.ts (152 lines) |
Unit (real tree-sitter) |
| Parse-impl concurrency | parse-impl-chunk-concurrency.test.ts, parse-impl-deferred-extraction.test.ts |
Unit |
| Worker pool options | worker-pool-options.test.ts, worker-pool-slot-generation.test.ts |
Unit |
The worker-pool-resilience.test.ts uses FakeWorker (not real Worker threads). The worker-pool.test.ts integration test uses real threads and is the primary IPC/lifecycle coverage. This split is acceptable: resilience logic is unit-tested with controlled fake workers; IPC contract is tested with real threads.
Verdict: Test coverage is adequate for the changed domains. DoD §2.7 exact-count assertions are present in new tests.
14. Final Verdict
Production-ready with minor follow-ups.
The core fix is sound and well-engineered: pathological files are bounded by the cumulative timeout budget + quarantine, worker crashes are bounded by the respawn budget + circuit breaker, IPC transfer-list is correctly zero-copy with no pool-buffer transfer risk, cache suppression correctly blocks partial chunks from being persisted, and the TypeScript capture optimization correctly preserves semantic output while eliminating the per-match root DFS.
Required before merge:
- Remove the
[WIP]from the PR title. - Remove or call the unused
findMatchfunction intypescript-captures-anchor.test.ts:28(CodeQL F1). - Either fill in the benchmark measurement rows in
bench/parse-throughput.mdor remove the self-contradictory "fill in before merging" instruction from the doc (F2).
Non-blocking follow-ups (can be separate PRs):
- Rename
WorkerPoolDispatchError.fallbackExcludePaths→quarantinedPathsorexcludePaths(F3). - Add a non-reentrancy assertion or JSDoc to
WorkerPool.dispatchto document and guard the contract (F7). - Consider adding
byteLengthvalidation in the hybrid envelope decode path. - Address P2 residuals from the PR description:
groupedNodesparallel map, Windows path separator, exit code 0 exemption.
Multi-lane review run on the PR #1693 branch surfaced four addressable items beyond the blocking three. F1 (minor, CodeQL): unused `findMatch` helper in test/unit/scope-resolution/typescript/typescript-captures-anchor.test.ts:28 removed. `countMatchesTsx` flagged by the same CodeQL pass is a false positive — it's called at line 88 by the JSX-anchor regression tests so the rewrite case actually fires under TSX, not just TS. F2 (medium, docs): bench/parse-throughput.md retitled as "(scaffold)" with an explicit "no measurement data has been collected yet" note above the table. The self-contradictory "Regenerate this file before merging any PR that touches the ingestion pipeline" instruction is dropped — the file ships intentionally without numbers; the load-bearing perf-regression protection lives in test/integration/parse-impl-large-fixture.test.ts (U6, 30s Promise.race wall-clock budget). The Latest measurement section now preserves the ~6s sequential observation as a smoke reference, not as a regression target. F3 (low, API hygiene): `WorkerPoolDispatchError.fallbackExcludePaths` renamed to `quarantinedPaths`. The "fallback" terminology was load-bearing under the pre-U20 design when `processParsing`'s sequential-fallback catch-block consumed it to filter the fallback file list. After commit be1f65c removed that catch-block, no production code reads the field — but it stays populated by the pool because the snapshot is genuinely useful operator diagnostics when the breaker trips. The rename clarifies the field's actual semantics (here are the files the pool quarantined before it tripped) without changing wire behavior. Definition + the lone surviving in-pool comment reference + both test assertions updated. F4 (low → real fix, reliability): timeout-retry IIFE in worker-pool.ts now consults `consecutiveFailureThreshold` and trips the circuit breaker when the per-slot consecutive-failure count crosses it. Closes a gap left by ce-code-review's REL-02 patch — that fix added the `consecutiveFailuresPerSlot[workerIndex]++` increment in the timeout-retry path but did NOT add the corresponding threshold-check + tripBreaker call. Result: chronic pure-timeout deaths accumulated counts that never tripped the breaker until the slot also hit `respawnCount > maxRespawnsPerSlot`. Now timeouts and crashes are structurally treated the same way by the breaker, which is what the REL-02 increment was meant to enable. Test coverage: worker-pool-resilience.test.ts already exercises the breaker via the shared handleWorkerDeath path; this new branch traces the same trip semantics with a different entry point, so the breaker-tripped state is observable via the same `getStats().poolBroken` and `WorkerPoolDispatchError.quarantinedPaths` surface. Out of scope here (caller actions or future PRs): - F5 (info): cumulative-quarantine cache check is safe in practice because chunks are alphabetically deterministic; no action. - F6 (low): exit-code-0 quarantine exemption — pre-existing P2 residual, bounded by quarantine + respawn budget; deferred. - F7 (info): dispatch non-reentrancy contract documented but not enforced; no production caller violates it; deferred. - PR title `[WIP]` removal — happens on GitHub side. Tests: 274/274 test files (6185 passing, 30 skipped). The single "error" in the integration runner is the pre-existing intentional- process.exit unhandled-exception leak from the deliberate startup- crash test worker, unchanged by these fixes.
CI scope-parity tests on Ubuntu surfaced silent data loss in the worker IPC: `Phase 'scopeResolution' failed: scope.typeBindings is not iterable` (Python, Go) and `importerModule.typeBindings.has is not a function` (Python). Plus three #1066 large-file regression tests (Python / C# / TypeScript) failed because call relationships weren't resolving from the worker output. **Root cause:** U17 introduced `JSON.stringify`/`JSON.parse` as the protocol body codec. JSON has no representation for `Map`, `Set`, `Date`, `RegExp`, `BigInt`, `TypedArray`, `undefined` values, or circular refs — `JSON.stringify(someMap)` returns `"{}"`. Production scope-resolution code keys data structures on Maps throughout (`ParsedFile.scopes[*].typeBindings: ReadonlyMap<string, TypeRef>`, plus `bindings`, `bySourceScope`, `byTargetDef`, the finalize-algorithm edge indexes, etc.). The JSON round-trip silently turned every Map into an empty object, manifesting downstream as iteration / `.has` calls failing on the decoded payload. **Fix:** replace the JSON body with `node:v8`'s `serialize` / `deserialize`. That's the same structured-clone algorithm Node's `worker.postMessage` uses natively — bit-for-bit compatible with the pre-U17 implicit-clone path. Full type fidelity for Map, Set, Date, RegExp, BigInt, TypedArray, undefined values, and circular refs. No external dependency. A previous iteration of this fix attempted to bolt a Map/Set replacer+reviver onto the JSON path. Rejected in favor of V8 serialization because: - the JSON tag-marker approach requires per-type registration (Map, Set; then Date, RegExp, BigInt would each need their own sentinels); V8 handles them all uniformly - keys to JSON-encode would still need handling for nested types (and the marker approach doesn't survive nested Maps-in-Maps cleanly without recursive replacer logic) - V8 is faster than JSON for object-heavy payloads anyway (binary format, no string escaping pass) - the user-explicit ask was "a much more generic solution that will work for everything" — V8 serialization IS the generic solution Trade-offs documented in the module header: - body bytes are opaque (binary, not human-readable) — debugging requires `v8.deserialize` ad-hoc; protocol.test.ts exercises every supported MessageTag including the new type-fidelity cases as a regression net. - format is tied to the running Node major. Pool always spawns workers on the same Node instance the main thread runs, so this is moot in production. Would matter if frames ever persisted to disk (nothing does today). Protocol test file rewritten: - drops the JSON-specific byte-layout assertions (e.g. `body must equal "null" string`) — replaced with V8-derived expected lengths - adds a "structured-clone type fidelity" describe block that pins Map, nested Map, Set, Date, RegExp, BigInt, TypedArray, undefined values, and circular-ref round-trips. These are the load-bearing regression tests preventing a future "optimize" PR from quietly swapping V8 back to JSON. - the bad-body decode-error test now uses arbitrary non-V8 bytes instead of `{not-json}` — same intent. Integration test READY_PREAMBLEs (worker-pool.test.ts and parse-impl-quarantine-cache-skip.test.ts) update their inline decoders to use `v8.deserialize` matching the production codec. Both files have a standalone CJS worker preamble that can't import dist/protocol.js by relative path, so the V8 dependency is required via `node:v8` directly. Tests: 271/271 unit files (6163 tests + 30 skipped). 28/28 worker-pool integration. 3/3 parse-impl integration. 791/791 scope-parity tests (the four CI-failing files: python.test.ts, go.test.ts, typescript.test.ts, csharp.test.ts) all green again. References plan: docs/plans/2026-05-20-002-fix-chunk-cache-corruption-on-worker-quarantine-plan.md
…rList
The protocol.ts framing layer was redundant — Node's `worker.postMessage`
already runs V8 structured-clone internally, the same algorithm that
backed `v8.serialize`. Wrapping V8.serialize → Buffer →
postMessage(struct-clone-Buffer) was a double-walk: one full
structured-clone pass to produce the Buffer, then another pass when
postMessage cloned that Buffer across threads. This commit cuts the
wrapper layer; workers and pool exchange POJO directly via
`worker.postMessage(value, transferList)`, with file-content
`ArrayBuffer`s in `transferList` for zero-copy ownership transfer.
What changes:
- **Deleted** `src/core/ingestion/workers/protocol.ts` (~180 LOC) +
`test/unit/workers/protocol.test.ts` (~250 LOC). The MessageTag
enum / ProtocolDecodeError / encodeMessage / decodeMessage surface
is gone. Tag-based routing is replaced by the `msg.type`
discriminant that every receive site already checks. Protocol-decode
errors map to Node's `messageerror` event (V8 deserialization
failures during postMessage), which the pool already wires to
`recoverAndResume`.
- **`worker-pool.ts`**: `decodeIncomingWorkerMessage` removed; handlers
receive POJO directly. `buildDispatchMessage` now returns
`{message: {type:'sub-batch', files: [{path, content: Uint8Array}]},
transferList: ArrayBuffer[]}`. The Uint8Array-per-content allocation
via `TextEncoder.encode` is preserved (it's the load-bearing
transfer-safety contract that keeps content out of Node's shared
`Buffer.poolSize` slab). Flush dispatch is now plain
`worker.postMessage({type:'flush'})`.
- **`parse-worker.ts`**: `decodeIncomingMessage` removed. The message
handler receives POJO directly; the only conversion is
`Uint8Array → string` for sub-batch file contents at the
`decodeSubBatchFiles` boundary, before handing to `processBatch`.
Outgoing messages are emitted as POJO via plain
`parentPort.postMessage({type:'starting-file', ...})` etc. The
`sharedHybridDecoder` is now `sharedContentDecoder` (same intent,
clearer name for the simpler shape).
- **Test scaffolding**: FakeWorkers in `worker-pool-resilience`,
`worker-pool-windows-quarantine`, and `worker-pool-slot-generation`
drop their `decodeMessage` import + `decodeDispatchedMessage` helper.
The helpers stay (still convert `files[i].content` Uint8Array →
string for test-action introspection) but no longer touch any
protocol framing — just shape-check for sub-batch.
- **Integration READY_PREAMBLEs** (worker-pool.test.ts and
parse-impl-quarantine-cache-skip.test.ts): drop the inline
v8.deserialize + envelope-unzip logic; the preamble is now just
the ready handshake + a `parentPort.on` wrapper that converts
`files[i].content` Uint8Array → string for the ad-hoc test worker
scripts.
- **`worker-pool-transferlist.test.ts`**: contract tests updated for
the new buildDispatchMessage shape — no `envelope` field anymore;
`message.files[i].content` is Uint8Array; transferList holds each
content.buffer in input order. Pool-slab independence still pinned.
What stays the same:
- Zero-copy file-content transfer via transferList — every file's
ArrayBuffer is ownership-transferred to the worker (no copy).
- Full structured-clone type fidelity — Map / Set / Date / RegExp /
BigInt / TypedArray / undefined / circular refs all preserved by
Node's native postMessage. The V8 fix from commit 06f6957e is
inherent in this path; there's no JSON layer to lose them.
- TextEncoder-per-content allocation — keeps content buffers out of
the shared `Buffer.poolSize` slab so transferring one cannot detach
another.
- The pool's resilience layers (respawn, breaker, quarantine,
starting-file attribution, cumulative timeout, ready handshake,
slot-generation guard) — unchanged.
- U20 chunk-cache write suppression on quarantine — unchanged.
Net: ~430 LOC removed (protocol.ts + tests + inline decoders + helpers),
~120 LOC simplified in worker-pool.ts and parse-worker.ts. One less
serialization pass per message on the hot path.
Tests: 270/270 unit files (6133 + 30 skipped). 822/822 integration
tests including the four CI-failing scope-parity files (Python, Go,
TypeScript, C#) — the V8-fidelity contract holds via native
postMessage with no explicit serializer. The single "error" reported
in worker-pool.test.ts is the pre-existing intentional
process.exit unhandled-exception artifact from the deliberate
startup-crash test, unchanged by this commit.
The `parentPort.on('message', ...)` handler had an `Array.isArray(msg)`
branch left over from a pre-sub-batch dispatch shape — the pool used
to send the items array directly, before the worker pool added
sub-batching and the `{type:'sub-batch', files: ...}` envelope.
No production caller has dispatched that shape since the sub-batching
refactor landed; verified by grepping the repo for `postMessage([`
patterns (zero matches). The `ParseWorkerInput[]` arm in the
`WorkerIncomingMessage` discriminated union also blocked
exhaustiveness narrowing — flagged by the kieran-typescript code
review (RR-01) as "if a future unit removes the legacy array path,
this arm should be dropped." Dropping it now.
What changes:
- Remove the `Array.isArray(msg)` branch from the message handler.
- Drop `ParseWorkerInput[]` from the `WorkerIncomingMessage` union;
it's now a clean `{type:'sub-batch'} | {type:'flush'}` discriminated
union, so the dispatch switch is exhaustive over `msg.type`.
Tests: 71/71 worker-pool unit + integration tests green (resilience,
slot-generation, windows-quarantine, transferlist, parsing-worker-
fallback, worker-pool integration, parse-impl-quarantine-cache-skip).
Windows benchmark: full-root
|
| Check | Result |
|---|---|
| Root analyze finishes (exit 0) | Yes — Repository indexed successfully |
reallyLargeFile.ts does not block the run |
Yes — worker timeout → quarantined on chunk 65/66 (~30 min wall on that chunk) |
compiler/checker.ts present in graph |
Yes — e.g. checkSourceElementWorker, structuredTypeRelatedToWorker, onFailedToResolveSymbol in top process entry points |
Benchmark runs (same machine, same env)
All runs used:
$env:NODE_OPTIONS = "--max-old-space-size=24576"
$env:NODE_ENV = "development" # required for 📊 throughput logs
$env:GITNEXUS_VERBOSE = "1"
$env:GITNEXUS_MAX_FILE_SIZE = "32768"
$env:GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS = "600000"
$env:GITNEXUS_PARSE_CHUNK_CONCURRENCY = "4"
gitnexus analyze <TypeScript-root> --force --index-only --skip-agents-md --skip-skills \
--max-file-size 32768 --worker-timeout 600 --workers 16 -v| Run | Commit | Outcome | Wall time | Nodes | Edges | Communities | Processes |
|---|---|---|---|---|---|---|---|
| 1 | b5d3c7eb (baseline on branch) |
OOM @ ~16 GB heap (no 24 GB cap) | ~2h 50m | — | — | — | — |
| 2 | b5d3c7eb |
SUCCESS exit 0 | ~4h 9m (14,958 s) | 321,508 | 489,999 | 4,274 | 300 |
| 3 | be1f65ce (binary IPC + transferList) |
Killed mid-run (superseded by native IPC) | ~1h 48m | — | — | — | — |
| 4 | 6c6fc774 (native postMessage + transferList, drop protocol.ts) |
SUCCESS exit 0 | ~4h 23m (15,787 s) | 321,507 | 490,000 | 4,274 | 300 |
Takeaway: Runs 2 and 4 produce effectively identical graphs (±1 node/edge). Run 4 is ~14 min slower end-to-end on this box; the important regression fix is completion without hang, not a large throughput win on TS-root.
Run 4 phase breakdown (6c6fc774)
| Phase | Logged duration |
|---|---|
| Worker chunks (66/66) | ~83 min to chunk completion |
Deferred parse (import log → ✓ Phase: parse) |
~151 min silent; parse phase total 13,601,889 ms (~3h 47m) |
scopeResolution |
2,047,396 ms (~34 min) |
| communities + processes + FTS | seconds |
Peak RSS during deferred parse: ~23 GB (24 GB heap cap). After ✓ Phase: parse, memory dropped to low GB and the run finished cleanly.
Quarantine evidence (Run 4)
Worker 0 parse job exhausted cumulative timeout budget … exhausted=["tests/cases/fourslash/reallyLargeFile.ts"]
Worker quarantine: 1 new file(s) skipped this chunk
📊 chunk 65/66: 1 files in 1801613ms … 1 quarantined
Chunk 66 completed with the file skipped; indexing continued through scope resolution and wrote meta.json.
Artifacts
- Index:
TypeScript/.gitnexus/meta.json(indexedAt: 2026-05-20,stats.files: 79,123) - Verbose log (Run 4):
.tmp-analyze-ts-root-native-msg.logon the benchmark worktree
What this validates for the PR
- Worker quarantine + skip timed-out files in sequential fallback — pathological files no longer wedge root analyze indefinitely.
- IPC path (
6c6fc774) — native structured clone +transferListcompletes the same full-root workload as the earlier successful run without OOM on a capped heap. - Graph quality — core compiler symbols (
compiler/checker.ts) indexed; stats match prior successful run.
Happy to re-run on another Windows host or post meta.json excerpts if useful for review.
Summary
Hardens the ingestion worker path so a single pathological file (e.g. microsoft/TypeScript's
tests/cases/fourslash/reallyLargeFile.ts) can't hanganalyzeand worker death can't permanently degrade the pool. Lands in three phases:postMessage— workers and pool exchange POJO directly viaworker.postMessage(value, transferList). Node's structured-clone algorithm preserves Map / Set / Date / RegExp / BigInt / TypedArray / undefined / circular refs out of the box. File contents move zero-copy viatransferList(TextEncoder.encodeper file → dedicatedArrayBuffer→ ownership transfer, never pool-backed). No custom protocol framing layer — one structured-clone walk per message, period.Workers are the sole failure-recovery contract for ingestion: there is no main-thread sequential rescue when a worker dies. Falling back to sequential would re-run the same tree-sitter parser that just killed the worker — for native-binding SIGSEGV (the most common quarantine cause), this turns silent missing-symbols into a loud analyze-wide crash. The pool's layered resilience is the contract; quarantined files surface in the warn log + stats; the next analyze with a fresh pool retries cleanly via cache-skip.
Architecture
flowchart TB classDef main fill:#1e3a5f,stroke:#4a90d9,color:#fff classDef pool fill:#2d4a3e,stroke:#7ab87d,color:#fff classDef layer fill:#3d3a2d,stroke:#d9b04a,color:#fff classDef worker fill:#4a3d4f,stroke:#b87dc4,color:#fff CLI[analyze CLI / MCP]:::main PI[parse-impl.ts chunk loop]:::main PP[parsing-processor.ts processParsing]:::main Cache[parseCache cross-run]:::main subgraph Pool["worker-pool.ts — createWorkerPool"] direction TB Disp[dispatch]:::pool Stats[getStats / getQuarantinedPaths]:::pool subgraph Layers["Resilience layers"] direction LR L1[Layer 1 Respawn budget]:::layer L2[Layer 2 Circuit breaker]:::layer L3[Layer 3 Quarantine]:::layer L4[Layer 4 Starting-file attribution]:::layer L5[Layer 5 Cumulative timeout]:::layer end subgraph Slots["Slot pool"] direction LR S0[Slot 0]:::pool S1[Slot 1]:::pool SN[... Slot N-1]:::pool end end W0[Worker 0 parse-worker.ts]:::worker W1[Worker 1 parse-worker.ts]:::worker WN[Worker N-1 parse-worker.ts]:::worker CLI --> PI PI -->|chunkFiles| PP PP -->|dispatch| Disp Disp --> Layers Layers --> Slots S0 <-->|postMessage + transferList| W0 S1 <-->|postMessage + transferList| W1 SN <-->|postMessage + transferList| WN PP -->|quarantine snapshot| PI PI -->|write only if no quarantine| Cache Stats -.->|verbose log| PIDispatch sequence (happy path)
sequenceDiagram autonumber participant PI as parse-impl chunk loop participant PP as processParsing participant Pool as WorkerPool participant Slot as Slot N participant W as Worker N participant Cache as parseCache PI->>PI: computeChunkHash(files) PI->>Cache: usedKeys.add(chunkHash) PI->>Cache: entries.get(chunkHash) Cache-->>PI: undefined — cache miss PI->>PP: processParsing(chunk, workerPool) PP->>Pool: dispatch(items) Note over Pool: filter quarantined paths<br/>buildDispatchMessage encodes each<br/>file content via TextEncoder Pool->>Slot: postMessage(payload, transferList) Slot->>W: structured-clone payload<br/>zero-copy transfer file ArrayBuffers Note over W: TextDecoder.decode<br/>content Uint8Array → string<br/>at tree-sitter call site loop per file W->>Slot: starting-file path Slot-->>Pool: record in-flight path W->>W: tree-sitter parse + extract W->>Slot: progress end W->>Slot: sub-batch-done W->>Slot: result Slot-->>Pool: collect results Pool-->>PP: results PP-->>PI: mergedData PI->>Cache: entries.set(chunkHash, rawResults) Note over Cache: cache write proceeds —<br/>no quarantine intersectionWorker death + quarantine + cache-skip
sequenceDiagram autonumber participant PI as parse-impl chunk loop participant PP as processParsing participant Pool as WorkerPool participant Slot as Slot N participant W as Worker N (alive) participant W2 as Worker N (replacement) participant Cache as parseCache PI->>PP: processParsing(a.ts, poison.ts, c.ts) PP->>Pool: dispatch Pool->>Slot: postMessage(payload, transferList) Slot->>W: deliver W->>Slot: starting-file a.ts Slot-->>Pool: inFlightPath = a.ts W->>Slot: progress + sub-batch-done for a.ts W->>Slot: starting-file poison.ts Slot-->>Pool: inFlightPath = poison.ts Note over W: tree-sitter SIGSEGV<br/>or uncaught throw W--xSlot: exit code 134 Slot->>Pool: handleWorkerDeath inFlight=poison.ts Note over Pool: Layer 3 quarantine.add(poison.ts)<br/>respawnCount slot N ++ (within budget) Pool->>W2: spawn replacement W2->>Slot: ready handshake Pool->>Slot: re-dispatch c.ts only Note over Slot: poison.ts filtered out<br/>by Layer 3 pre-dispatch Slot->>W2: postMessage W2->>Slot: result for c.ts Slot-->>Pool: collect results Pool-->>PP: merged data — a.ts and c.ts only Note over PP: poison.ts symbols absent<br/>from this run's graph PP->>Pool: getQuarantinedPaths() Pool-->>PP: [poison.ts] PP->>PP: warn log — 1 file quarantined PP-->>PI: mergedData (no rescue) PI->>Pool: getQuarantinedPaths() Pool-->>PI: [poison.ts] Note over PI: chunkFiles intersects quarantine PI--xCache: SKIP entries.set(chunkHash) Note over Cache: chunk stays uncached<br/>next analyze gets a fresh pool<br/>cache miss triggers retryCircuit breaker trip
sequenceDiagram autonumber participant PP as processParsing participant Pool as WorkerPool participant L1 as Layer 1 Respawn participant L2 as Layer 2 Breaker participant Slot as Slot N participant W as Worker N loop until threshold PP->>Pool: dispatch Pool->>Slot: deliver Slot->>W: postMessage W--xSlot: exit Slot->>L1: bump respawn + consecutiveFailures alt budget OK L1->>Slot: replaceWorker else respawnCount exceeded L1->>Pool: drop slot end Note over L2: check threshold<br/>crashes AND pure-timeouts both count end L2--xPool: tripBreaker Note over Pool: WorkerPoolDispatchError<br/>carries quarantinedPaths snapshot Pool-->>PP: reject — no sequential rescue PP-->>PP: error propagates to analyze entry Note over PP: operator sees hard failure<br/>with quarantine snapshot<br/>instead of silently-degraded graphResilience layers (the contract)
maxRespawnsPerSlot(default 3, envGITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT) — slot dropped from rotation when exceededconsecutiveFailureThreshold(defaultmax(3, poolSize), envGITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD) — pool rejects further dispatchesSet<string>on the pool — filtered out of subsequent dispatches; surfaced viagetQuarantinedPaths(){type:'starting-file', path}before each parseinFlightPath[slot]is the authoritative culprit on death; falls back toitems[lastProgress]heuristic when no starting-file observedmaxCumulativeTimeoutMs(default 5× sub-batch timeout, envGITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS) — prevents exponential backoff from inflating into multi-hour stallsWORKER_READY_TIMEOUT_MS(5s) — worker emits{type:'ready'}after top-of-script init; pool drops the slot if not seen within budgetIPC: native
postMessage+ zero-copytransferListNo custom wire format. Workers and pool exchange POJO via
worker.postMessage(value, transferList). Node'spostMessageruns V8's structured-clone algorithm internally — the same algorithm that preserves Map / Set / Date / RegExp / BigInt / TypedArray / undefined / circular refs natively. There is no JSON layer to lose them.Dispatch payload shape (built by
buildDispatchMessage):The file
contentis aUint8Array(not a string) so its underlyingArrayBuffercan be transferred zero-copy via thetransferListargument. The worker callsTextDecoder.decodelazily at the tree-sitter call site — one decode pass on the worker thread, in parallel with continued main-thread work.Transfer-safety contract. Content
Uint8Arrayinstances are allocated viaTextEncoder.encode, which produces a dedicatedArrayBufferper call.Buffer.from(str, 'utf8')andBuffer.allocmay carve from Node's sharedBuffer.poolSizeslab, and transferring one pool-backedArrayBufferdetaches every other Buffer sharing the slab — silent data corruption. TextEncoder bypasses the pool; transferring its outputs is safe. Pinned bytest/unit/worker-pool-transferlist.test.ts(8 ArrayBuffer identities for 8 small files).Why no wrapper protocol. An earlier iteration of this PR introduced a binary frame (
protocol.ts— 1-byte tag + 4-byte LE length + V8-serialized body). Removed in commit6c6fc774: that layer was a redundantV8.serialize → Buffer → postMessage(struct-clone-Buffer)double-walk, becausepostMessageruns the same structured-clone algorithmv8.serializewould call. Dropping the wrapper cut one full clone pass per message and ~430 LOC of framing code + tests.Cross-run cache integrity
flowchart LR classDef path fill:#2d4a3e,stroke:#7ab87d,color:#fff classDef skip fill:#4a2d2d,stroke:#c97a7a,color:#fff A[Chunk N dispatched]:::path B{Any chunk file in pool.getQuarantinedPaths?} C[cache.entries.set chunkHash, rawResults]:::path D[SKIP cache.entries.set]:::skip E[cache.usedKeys.add chunkHash both branches] F[Next analyze fresh pool cache hit/miss] A --> B B -->|No| C B -->|Yes| D C --> E D --> E E --> FBefore the fix, a chunk containing a quarantined file would persist the partial worker output under the full-coverage chunk hash. The next analyze with unchanged content would hit the cache and silently replay incomplete results — symbols for the quarantined file permanently missing until the file's content changed. After the fix, the chunk stays uncached so the next analyze gets a fresh pool (quarantine is session-scoped) and a clean retry.
Configuration knobs
GITNEXUS_WORKER_POOL_SIZE--workers <n>min(16, cores-1)0= sequential.GITNEXUS_PARSE_CHUNK_CONCURRENCYGITNEXUS_CHUNK_BYTE_BUDGET2 MBGITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS--worker-timeout <s>× 1000GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES8 MBpostMessage.GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOTGITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MSGITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLDmax(3, poolSize)GITNEXUS_VERBOSE-v / --verboseImplementation highlights
worker-pool.ts): all five layers + ready-handshake + slot-generation, all boundedcaptures.ts):findSelfOrAncestorOfType[s]replaces per-match root-DFS infindNodeAtRange; dropsemitTsScopeCapturesonchecker.tsfrom ~7 min to ~3 sworkers/quarantine.ts): session-scopedSet-keyed snapshot APIbuildDispatchMessageinworker-pool.ts): no custom wire format;TextEncoder.encodeper content;transferListfor ownership transferparse-impl.ts): cache-write guard skips when any chunk file is quarantinedparsing-processor.ts): workers are the sole resilience contract; pool failures propagate as hard errors instead of silent degradationanalyze.ts,cli/index.ts):--workers, env snapshot/restore,--workers 0parse-impl.ts): pre-fetch K chunks ahead viaparseChunkConcurrency; dispatch stays serialparse-impl.ts): per-chunk extraction passes batched to end-of-loop so workers stay busyTesting
test/unit/worker-pool-resilience.test.ts(26 tests),worker-pool-slot-generation.test.ts,worker-pool-windows-quarantine.test.ts,worker-pool-cumulative-timeout.test.ts,worker-pool-options.test.tstest/unit/worker-pool-transferlist.test.ts(6 tests pinning: Uint8Array per content, transferList in input order, pool-slab independence, non-parse / empty / mixed-shape fallback)test/unit/workers/quarantine.test.tstest/integration/parse-impl-quarantine-cache-skip.test.ts— realworker_threads, custom worker script crashes on poison.ts, asserts (a) surviving files in graph, (b) poison.ts NOT in graph (no rescue), (c) chunk hash absent fromparseCache.entries, (d) second pass cache-misses + re-dispatches + still doesn't cachetest/unit/scope-resolution/typescript/typescript-captures-anchor.test.ts— exact-count assertions on every anchor type the rewrite touchestest/integration/parse-impl-large-fixture.test.ts— synthetic large fixture, 30 sPromise.racebudgettest/integration/worker-pool.test.ts(28 tests) — realWorkerthreads, full lifecycle (death, respawn, breaker, ready handshake)test/integration/resolvers/{python,go,typescript,csharp}.test.ts— exercise large UTF-8/ASCII files through real workers; native postMessage preserves Map fidelity so scope-resolution doesn't choke on the worker round-tripDoD §2.7 exact-count assertions throughout. 270/270 unit files + 7/7 worker-pool/parse-impl integration files green. 6133 unit tests + 822 integration tests passing.
Production-readiness review checkpoints
External multi-lane review surfaced four addressable items, all closed:
findMatchhelper removed (CodeQL was right; the siblingcountMatchesTsxflag was a false positive — it IS called).bench/parse-throughput.mdretitled as scaffold; self-contradictory "fill in before merging" instruction dropped.WorkerPoolDispatchError.fallbackExcludePathsrenamed toquarantinedPaths(the "fallback" terminology was load-bearing under the pre-sequential-removal design; after the rescue path was removed, the rename reflects what the field actually carries).worker-pool.tsnow checksconsecutiveFailureThresholdand trips the circuit breaker. Closes a gap where chronic pure-timeout deaths accumulated counts that never tripped the breaker.Plus the CI-discovered issue that drove the architecture pivot: an early iteration of the IPC layer used a JSON-based protocol body, which silently destroyed every
Map/Set/Date/BigInt/TypedArray/undefinedvalue in the worker output. Production scope-resolution keys data structures on Maps throughout, so JSON serialization manifested as'typeBindings is not iterable'on large fixtures. The fix went through two iterations: first a swap tov8.serialize(binary structured-clone in a Buffer), then realizing the whole wrapper was redundant and droppingprotocol.tsentirely for nativepostMessage. Final state — no explicit serializer — is both correct (full structured-clone fidelity inherent inpostMessage) and faster (one clone walk instead of two).Deferred follow-ups (non-blocking, P2):
groupedNodesparallel map incaptures.ts:188— replace withRecord<string, {capture, node}>to eliminate implicit sync.worker-pool.tsexitHandler— bounded by quarantine + respawn budget today; tightening adds a dedicated code-0 branch.WorkerPool.dispatchnon-reentrancy is documented in comments but not enforced by an assertion. No production caller violates it.Original task